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Policy Recognition in the Abstract Hidden Markov Model 



In this paper, we present a method for recognising an agent's behaviour in dynamic, 
noisy, uncortain domains, and across multiple levels of abstraction. Wc term this problem 
on-line plan recognition under uncertainty and view it generally as probabilistic inference on 
the stochastic process representing the execution of the agent's plan. Our contributions in 
this paper are twofold. In terms of probabihstic inference, we introduce the Abstract Hidden 
Markov Model (AHMM), a novel type of stochastic processes, provide its dynamic Bayesian 
network (DBN) structure and analyse the properties of this network. We then describe 
an application of the Rao-Blackwellised Particle Filter to the AHMM which allows us to 
construct an efficient, hybrid inference method for this model. In terms of plan recognition, 
we propose a novel plan recognition framework based on the AHMM as the plan execution 
model. The Rao-Blackwellised hybrid inference for AHMM can take advantage of the 
independence properties inherent in a model of plan execution, leading to an algorithm for 
online probabilistic plan recognition that scales well with the number of levels in the plan 
hierarchy. This illustrates that while stochastic models for plan execution can be complex, 
they exhibit special structures which, if exploited, can lead to efficient plan recognition 
algorithms. We demonstrate the usefulness of the AHMM framework via a behaviour 
recognition system in a complex spatial environment using distributed video surveillance 
data. 

1. Introduction 

Plan recognition is the problem of inferring an actor's plan by watching the actor's actions 
and their effects. Often, the actor's behaviour follows a hierarchical plan structure. Thus, 
in plan recognition, the observer needs to infer about the actor's plans and sub-plans at 
different levels of abstraction in its plan hierarchy. The problem is complicated by the two 
sources of uncertainty inherent in the actor's planning process: (1) the stochastic aspect of 
plan refinement (a plan can be non-deterministically refined into different sub-plans), and 
(2) the stochastic outcomes of actions (the same action can non-deterministically result in 
different outcomes). Furthermore, the observer has to deal with a third source of uncertainty 
arising from the noise and inaccuracy in its own observation about the actor's plan. In 
addition, we would like our observer to be able to perform the plan recognition task "on- 
line" while the observations about the actor's plan are streaming in. We refer to this general 
problem as on-line plan recognition under uncertainty. 
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The seminal work in plan recognition (Kautz &; Allen, 1986) considers a plan hierarchy, 
but does not deal with the uncertainty aspects of the problem. As a result, the approach 
can only postulate a set of possible plans for the actor, but is unable to determine which 
plan is more probable. Since then, the important role of uncertainty reasoning in plan 
recognition has been recognised (Charniak & Goldman, 1993; Bauer, 1994; van Beek, 1996), 
and Bayesian probability has been argued as the appropriate model (Charniak &: Goldman, 
1993; van Beek, 1996). The dynamic, "on-line" aspect of plan recognition has only been 
recently considered (Pynadath k Wellman, 1995, 2000; Goldman, Geib, k Miller, 1999; 
Huber, Durfee, k Wellman, 1994; Albrecht, Zukerman, k Nicholson, 1998). All of this 
recent work shares the view that online plan recognition is largely a problem of probabilistic 
inference in a stochastic process that models the execution of the actor's plan. While this 
view offers a general and coherent framework for modelling different sources of uncertainty, 
the stochastic process that we need to deal with can become quite complex, especially if we 
consider a large plan hierarchy. Thus, the main issue here is the computational complexity 
for dealing with this type of stochastic processes, and whether the complexity is scalable to 
more complex plan hierarchies. 

1.1 Aim and Significance 

In this paper, we demonstrate that the type of plan recognition problems described above 
scales reasonably well with respect to the number of levels of abstraction in the plan hi- 
erarchy. This is in contrast to the common-sense analysis that more levels in the plan 
hierarchy would introduce more variables to the stochastic process, which in turn, results 
in exponential complexity w.r.t the number of levels in the hierarchy. 

In order to achieve this, we first assume a general stochastic model of plan execution 
that can model the three sources of uncertainty involved. The model for planning with 
a hierarchy of abstraction under uncertainty has been developed recently by the abstract 
probabilistic planning community (Sutton, Precup, k Singh, 1999; Parr k Russell, 1997; 
Forestier k Varaiya, 1978; Hauskrecht, Meuleau, Kaelbling, Dean, k Boutilier, 1998; Dean 
k Lin, 1995). To our advantage, we adopt their basic model, known as the abstract Markov 
policies (AMP) ^ as our model for plan execution. The AMP is an extension of a policy 
in Markov Decision Processes (MDP) that enables an abstract policy to invoke other more 
refined policies and so on down the policy hierarchy. Thus, the AMP is similar to a contin- 
gent plan that prescribes which sub-plan should be invoked at each applicable state of the 
world to achieve its intended goal, except that it can represent both the uncertainty in the 
plan refinement and in the outcomes of actions. Since an AMP can be described simply in 
terms of a state space and a Markov policy that selects among a set of other AMP's, using 
the AMP as the model for plan execution also helps us focus on the structure of the policy 
hierarchy. 

The execution of an AMP leads to a special stochastic process which we called the 
Abstract Markov Model (AMM). The noisy observation about the environment state (e.g., 
the effects of action) can then be modelled by making the state "hidden" , similar to the 
hidden state in the Hidden Markov Models (Rabiner, 1989). The result is an interesting and 
novel stochastic process which we term the Abstract Hidden Markov Model. Intuitively, the 

1. Also known as options, policies of Abstract Markov Decision Processes, or supervisor's policies. 
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AHMM models how an AMP causes the adoption of other policies and actions at different 
levels of abstraction, which in turn generate a sequence of states and observations. In the 
plan recognition task, an observer is given an AHMM corresponding to the actor's plan 
hierarchy, and is asked to infer about the current policy being executed by the actor at all 
levels of the hierarchy, taking into account the sequence of observations currently available. 
This amounts to reversing the direction of causality in the AHMM, i.e. to determine a set 
of policies that can explain the sequence of observations at hand. We shall refer to this 
problem as policy recognition. 

Viewing the AHMM as a type of dynamic Bayesian network (Dean & Kanazawa, 1989: 
Nicholson &: Brady, 1992), it is known that the complexity of this kind of inferencing in the 
DBN depends on the size of the representation of the so-called belief state, the conditional 
joint distribution of the variables in the DBN at time t given the observation sequence up 
to t (Boyen &: Koller, 1998). Thus we can ask the following question: how does the policy 
hierarchy affect the size of the belief state representation of the corresponding AHMM? 

Generally, for a policy hierarchy with K levels, the belief state would have at least 
K variables and thus the size of their joint distribution would be 0{exp{K)). However, 
the AHMM has a specific network structure that exhibits certain conditional independence 
properties among its variables which can be exploited for efficiency. We first identify these 
useful independence properties in the AHMM and show that there is a compact representa- 
tion of the special belief state in the case where the state sequence can be correctly observed 
(full observability assumption) and the starting and ending time of each policy is known. 
Consequently, policy recognition in this case can be performed very efficiently by updating 
the AHMM compact belief state. This partial result, although too restricted to be useful by 
itself, leads to an important observation about the general belief state: although it cannot 
be represented compactly, it can be approximated efficiently by a collection of compact spe- 
cial belief states. This makes the inference problem in the AHMM particularly amenable 
to a technique called Rao-Blackwellisation (Casella & Robert, 1996) which allows us to 
construct hybrid inference methods that combine both exact inference and approximate 
sampling-based inference for greater efficiency. The application of Rao-Blackwellisation to 
the AHMM structure reduces the sampling space that we need to approximate to a space 
with fixed dimension that does not depend on if, ensuring that the hybrid inference algo- 
rithm scales well w.r.t K. 

The contributions of the paper are thus twofold. In terms of stochastic processes and 
dynamic Bayesian networks, we introduce the AHMM, a novel type of stochastic processes, 
provide its DBN structure and analyse the properties of this network. We present an appli- 
cation of the Rao-Blackwellised Particle Filter to the AHMM which results in an efficient 
hybrid inference method for this stochastic model. In terms of plan recognition, we propose 
a novel plan recognition framework based on probabilistic inference using the AHMM as the 
plan execution model. The complexity of the inference problem is addressed by applying 
a range of recently developed techniques in probabilistic reasoning to the plan recognition 
problem. Our work illustrates that while the stochastic models for plan execution can be 
complex, they exhibit certain special structures that can be exploited to construct efficient 
plan recognition algorithms. 
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1.2 Structure of the Paper 

The main body of the paper is organised as follows. Section 2 introduces the background 
material in dynamic Bayesian networks and probabilistic inference. Section 3 formally de- 
fines the abstract Markov policy and the policy hierarchy. Section 4 presents the AHMM, 
its DBN representation and conditional independence properties. The algorithms for pol- 
icy recognition are discussed in Section 5, first for the special tractable case and then for 
the general case. Section 6 presents our experimental results with the AHMM framework, 
including a real-time system for recognising people behaviour in a complex spatial environ- 
ment using distributed video surveillance data. Section 7 provides a comparative review of 
related work in probabilistic plan recognition. Finally, we conclude and discuss directions 
for further research in Section 8. 

2. Background in Probabilistic Inference 

The aim of this section is to familiarise readers with some concepts in probabilistic inference 
that will be used later on in the paper. In subsections 2.1 and 2.2, we discuss Bayesian 
Networks (BN) and Dynamic Bayesian Networks (DBN) in general. In subsection 2.3, 
we discuss the Sequential Importance Sampling (SIS) algorithm, a general approximate 
sampling-based inference method for dynamic models. Subsections 2.4 and 2.5 introduce 
Rao-Blackwellisation, a technique for improving sampling-based methods by utilising certain 
special structures of the dynamic model. Later on, Rao-Blackwellisation will be used as our 
key computational technique for performing policy recognition. 

2.1 Bayesian Networks 

The Bayesian network (BN) (Pearl, 1988; Jensen, 1996; Castillo, Gutierrez, & Hadi, 1997) 
(also known as probabilistic network or belief network) is a well-established framework for 
dealing with uncertainty. It provides a graphical and compact representation of the joint 
probability distribution of a set of domain variables Xi, . . . X„ in the form of a directed 
acyclic graph (DAG) whose nodes correspond to the domain variables. For each node 
Xj, the links from the parent nodes Pa{Xi) are parameterised by the conditional prob- 
ability of that node given the parents Pr(Xj \ Pa{Xi)). The network structure together 
with the parameters encode a factorisation of the joint probability distribution (JPD) 
Pr(Xi, . . . Xn) = nr=i I Given a Bayesian network, conditional independence 

statements of the form X _L y | Z (X is independent of Y given Z, where X, Y, Z are vari- 
ables or sets of variables) can be asserted if X is d-separated from y by Z in the network 
structure, where d-separation is a graph separation concept for DAGs (Pearl, 1988). The 
network structure of a BN thus captures certain conditional independence properties among 
the domain variables which can be exploited for efficient inference. 

The main inference task on a Bayesian network is to calculate the conditional probability 
of a set of variables given the values of another set of variables (the evidence). There are 
two types of computation techniques for doing this. Exact inference algorithms (Lauritzen 
& Spiegelhalter, 1988; Jensen, Lauritzen, h Olesen, 1990; D'Ambrosio, 1993) compute 
the exact value of the conditional probability required based on analytical transformation 
that exploits the conditional independence relationships of the variables in the network. 
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Approximative inference algorithms (Pearl, 1987; York, 1992; Henrion, 1988; Fung & Chang, 
1989; Shachter & Peot, 1989) compute only an approximation of the required probability, 
usually obtained either through "forward" sampling (Henrion, 1988; Fung &; Chang, 1989; 
Shachter &; Peot, 1989) (a variance of Bayesian Importance Sampling (Geweke, 1989)), or 
through Gibbs (Monte-Carlo Markov-Chain) sampling (Pearl, 1987; York, 1992). These 
algorithms have the advantages of simple implementation, can be applied to all types of 
network, and can trade off the accuracy in the estimates for computation resources. It is 
known that exact inference in BN is NP-hard with respect to the network size (Cooper, 
1990), while approximate inference, although scales well with the network size, is NP-hard 
with respect to the hard-bound accuracy of the estimates (Dagum & Luby, 1993). In the 
light of these theoretical results, approximate inference can be useful in large networks when 
exact computation is intractable, but a certain degree of error in the probability estimate 
can be tolerated by the application. 

2.2 Dynamic Bayesian Networks 

To model the temporal dynamics of the environment, the Dynamic Bayesian Network 
(DBN) (Dean h Kanazawa, 1989; Nicholson & Brady, 1992; Dagum, Galper, k, Horvitz, 
1992) is a special Bayesian network architecture for representing the evolution of the do- 
main variables over time. A DBN consists of a sequence of time-slices where each time-slice 
contains a set of variables representing the state of the environment at the current time. 
A time-slice is in itself a Bayesian network, with the same network structure replicated at 
each time-slice. The temporal dynamics of the environment is encoded via the network links 
from one time-slice to the next. In addition, each time-slice can contain observation nodes 
which model the (possibly noisy) observation about the current state of the environment. 

Given a DBN and a sequence of observations, we might want to draw predictions 
about the future state variables (predicting), or about the unobserved variables in the 
past (smoothing) (Kjaerulff, 1992). This problem can be solved using an inference algo- 
rithm for Bayesian networks described above. However, if we want to revise the prediction 
as the observations arrive over time, reapplying the inference algorithm each time the ob- 
servation sequence changes could be costly, especially as the sequence grows. To avoid this, 
we need to keep the joint distribution of all the variables in the current time-slice, given 
the observation sequence up to date. This probability distribution is termed the belief state 
(also known as the filtering distribution) and plays an important role in inferencing in the 
DBN. All existing inference schemes for the DBN involve maintaining and updating the 
belief state (i.e., filtering). When a new observation is received, the current belief state is 
rolled over one time-slice ahead following the evolution model, then conditioned on the new 
observation to obtain the updated belief state. 

An obvious problem with this approach is the size of the belief state that we need to 
maintain. It has been noted that while the interaction of the variables in the DBN is 
localised, the variables in the belief state can be highly connected (Boyen h Koller, 1998). 
This is because the marginalisation of the past time-slices usually destroys the conditional 
independence of the current time-slice. When the size of the belief state is large, exact 
inference methods like (Kjaerulff, 1995) is intractable, and it becomes necessary to maintain 
only an approximation of the actual belief state, either in the form of an approximate 
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distribution that can be represented compactly (Boyen h KoUer, 1998), or in the form of 
a set of weighted samples as in the Sequential Monte-Carlo Sampling methods (Doucet, 
Godsill, &; Andrieu, 2000b: Kanazawa, Roller, & Russell, 1995; Liu & Chen, 1998). 

The most simple case of the DBN where, in each time-slice, there is only a single state 
variable and an observation node, is the well-known Hidden Markov Model (HMM) (Ra- 
biner, 1989). Filtering in this simple structure can be solved using dynamic program- 
ming in the discrete HMM (Rabiner, 1989), or Kalman filtering in the linear Gaussian 
model (Kalman, 1960). More recently, extensions of the HMM with multiple hidden in- 
teracting chains such as the Coupled Hidden Markov Models (CHMM) and the Factorial 
Hidden Markov Models (FHMM) have been proposed (Brand, 1997: Ghahramani & Jordan, 
1997; Jordan, Ghahramani, & Saul, 1997). In these models, the size of the belief state is 
exponential in the number of hidden chains. Therefore, the inference and parameter estima- 
tion problems become intractable if the number of hidden chains is large. For this reason, 
approximate techniques are required. CHMM (Brand, 1997) employs a deterministic ap- 
proximation that approximates full dynamic programming by keeping only a fixed number 
of "heads" with highest probabilities. The "heads" are thus chosen deterministically rather 
than randomly as in sampling-based methods. FHMM (Ghahramani &; Jordan, 1997; Jor- 
dan et al., 1997) uses variational approximation (Jordan, Ghahramani, Jaakkola, &: Saul, 
1999) which approximates the full FHMM structure by a sparsified tractable structure. This 
idea is similar to the structured approximation method in (Boyen &: Koller, 1998). 

Our AHMM can be viewed as a type of Coupled/Factorial HMM since the AHMM 
also consists of a number of interacting chains. However the type of interaction in our 
AHMM is different from the other types of interaction that have been considered (Brand, 
1997; Jordan et al., 1997; Ghahramani &: Jordan, 1997). This is because the main focus 
of the AHMM is the dynamics of temporal abstraction among the chains, rather than the 
correlation between them at the same time interval. In addition, each node in the AHMM 
has a specific meaning (policy, state, or policy termination status), and the links have a 
clear causal interpretation based on the policy selection and persistence model. This is in 
contrast to the Coupled/Factorial HMM where the nodes and links usually do not have 
any clear semantic/causal interpretation. The advantage is that prior knowledge about the 
temporal decomposition of an abstract process can be incorporated in the AHMM more 
naturally. 

2.3 Sequential Importance Sampling (SIS) 

Sequential Importance Sampling (SIS) (Doucet et al., 2000b; Liu h Chen, 1998), also 
known as Particle Filter (PF), is a general Monte-Carlo approximation scheme for dynamic 
stochastic models. In principle, the SIS method is the same as the so-called Bayesian 
Importance Sampling (BIS) estimator in the static case (Geweke, 1989). Suppose that we 
want to estimate the quantity / = J f{x)p{x)dx, i.e., the mean of /(a;) where a; is a random 
variable with density p. Note that if / is taken as the identity function of an event A 
then / is simply Pr(^). Let q{x) be an arbitrary^ density function, termed the importance 
distribution. Usually, the importance distribution q is chosen so that is it easy to obtain 



2. For the weight to be properly defined, the support of q has to be a subset of the support of p. 
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random samples from it. The expectation under estimation can then be rewritten as: 

- ^ J[f{x)p{x)/q{x)]q{x)dx ^ Eg f {x)p{x) / q{x) 
J[p{x)/q{x)]q{x)dx Eqp{x)/q{x) 

Prom this expression, the BIS estimator w.r.t q can be obtained: 



2=1 



where {a;^*)} are the N i.i.d samples taken from q{x), w{x) = p{x)/q{x) and w is the 
normalised weight w{x^^^) = w{x^^^)/ ^■w(x^'''>). Note that the normalised weight can be 
computed from any weight function w{x) a w{x), i.e., the weight function need only be 
computed up to a normalising constant factor. 

In the dynamic case, we want to estimate / = f {xt)p{xt\ot) where xt = {xq, . . . ,xt) 
and ot = {oq,. . . ,ot) are two sequences of random variables; ot represents the observation 
available to us at time t. Often, is a Markov sequence and ot is the observation of Xi 
as in a HMM. In a DBN, xt corresponds to the set of state variables and ot corresponds to 
the set of observations at time-slice t. The SIS method presented here however applies to 
the most general case where {xt) can be non-Markov, and ot not only depends on x^. 

We now can introduce the importance distribution q{xt\ot) to obtain the estimator: 

N 

f^fsis = ^f{x?)wi4^) (1) 

i=l 

To ensure that we can obtain sample from q{xt\ot) "online", i.e., to sample a new value 
Xf for the sequence xt when the current observation Of arrives, q must be restricted to the 
form: 

q{xt\dt) = q{xt-i\dt-i)q{xt\xt-i,dt) 

With this restriction on q, we can use the weight function w{xt) = p{xt,ot)/q{xt\ot) so that 
the weight can also be updated "online" using: 

w{xt) = w{xt-i)p{xt,ot\xt-i,dt-i)/q{xt\xt-i,ot) (2) 

Let wt = w{xt)/w{xt-i) be the weight updating factor at time t, and qt = q{xt\xt-i,ot) 
be the sampling distribution used at time t. Prom (2) we have 

wtqt = p{xt,ot\xt-i,dt-i) (3) 

which means that p{xtTOt\xt-i-Ot-i) is factorised into two parts: Wf and qt. By choos- 
ing different factorisations, we obtain different forms for qt and thus different important 
distributions q. Por example, when (it,ot) is a HMM, qt can be chosen as p{xt\xt-i) 
with Wt = p{ot\xt) as in the likelihood weighting (LW) method, or qt can be chosen as 
p{xt\xt-i,ot) with Wt = p{ot\xt-i) as in the likelihood weighting with evidence rever- 
sal (LW-ER) (Kanazawa et al., 1995). In general, the "forward" qt can be chosen as 
p{xt\xt-i,ot-i) with the corresponding weight wt = p{ot\xt,ot-i). The "optimal" qt, in 
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the sense discussed in (Doucet et al., 2000b), is chosen as qt = p{xt\xt-i,dt) with the 
associating = j;(o/,|i(_i, 0(_i). 

The general SIS approximation scheme is thus as follows. At time t — 1, we maintain N 
sample sequences {i^'^i} and the N corresponding weight values {lo^*^}. When the current 
observation ot arrives, each sequence x^^}_^ is lengthened by a new value x^"^ sampled from 

the distribution q{xt\x[^li,dt). The weight value for xj*^ is then updated using (2). Once 
the new samples and the new weights are obtained, the expectation of any functional / can 
be estimated using (1). This procedure can be furthered enhanced with a re-sampling step 
and a Markov-chain sampling step (see Doucet et al. (2000b), Doucet, de Preitas, Murphy, 
and Russell (2000a)). We do not describe these important improvements of the SIS here.^ 

2.4 Rao-Blackwellisation 

Rao-Blackwellisation is a general technique for improving the accuracy of sampling methods 
by analytically marginalising some variables and only sampling the remainder (Casella &: 
Robert, 1996). In its simplest form, consider the problem of estimating the expectation 
Ef{x), where a; is a joint product of two variables r,z. Using direct Monte-Carlo sam- 
pling, we obtain the estimator: / = ;^ X^i^ /('"^'^ ^^'^)- Alternatively, a Rao-Blackwellised 
estimator can be derived by sampling only the variable r, with the other variable z being 
integrated out analytically: 

Ef{r,z)=E h{r) « Jeb = J^T. '*(^^'^) 

1 

where /i(r) = E2[/(r, 2;)|r]. For our convenience, r will be referred to as the iZao-iJIactoe/Zism^ 
variable. 

The Rao-Blackwellised estimator Jrb is generally more accurate than / for the same 
number of samples N. This is a direct consequence of the Rao-Blackwell theorem which 
gives the relationship between unconditional and conditional variance: 

VARX = VAR[E[X|y]] + E[VAR[X|y]] 

When applying to the problem of estimating E /(r, z), we have: 

VAR/(r,z) = VAR[E[/(r,z)|r]] +E[VAR[/(r,z)|r]] 

and thus VAR/(r,;2) > VAR[E[/(r, ^)|r]] = VAR/i(r). This suggests that for direct Monte- 
Carlo sampling, the error of RB-sampling (sample only r and marginalise z) is always 
smaller than the error of sampling both r and z for the same number of samples, except in 
the degenerated case. For Bayesian Importance Sampling, using the variance convergence 
result from (Geweke, 1989), one can also easily prove that as the number of samples tend to 
infinity, the RB-BIS would generally do better than BIS for the same number of samples. 

3. Note that these improvements can be used orthogonal to the Rao-Blackwellisation procedure discussed 
subsequently. Our implementation of the policy recognition algorithm in the later sections does include 
a re-sampling step, which is crucial for keeping the error of SIS over time under control. 



458 



Policy recognition in the Abstract Hidden Markov Model 



2.5 SIS with Rao-Blackwellisation (RB-SIS) 

Since SIS is a form of BIS, Rao-Blackwellisation can also be used to improve its perfor- 
mance (Liu &: Chen, 1998: Doucet et al., 2000b). Let us consider again the problem of 
estimating the expectation / = J f{xt)p(xt\ot), where each variable xt is the joint product 
of two variables {zt,rt). We shall restrict ourselves to the case where xt is Markov and 
ot is an observation of Sj, i.e., when (it,ot) can be represented by a DBN. In addition, 
we only consider / that depends only on the current variable Xt, i.e., / is an expectation 
over the filtering distribution jr)(2;i|o(). For example, if ^ is a "future" event, i.e., an event 
that depends on {xti\t' > t}, we can estimate p{A\dt) by letting f{xt) = p{A\xt) so that 

Applying Rao-Blackwellisation to this setting, we can let h{ft) = f^^ f {zt, rt)p{zt\rt, dt) , 
so that f = h = h{rt)p{rt\ot) ■ Thus, if we use SIS to estimate h, we also obtain an 
estimator for /: 

N 

I ~ fRBSis = hsis = Kft^)Hr^^) (4) 

i=l 

The benefit of doing this is the increase in the accuracy of the estimator, as we now 
only need to sample the variables ft. The down side is that for each sample fj, we need 
to compute h{ft) using some exact inference method. Furthermore, the SIS procedure to 
estimate h might require some additional complexity since the sequence ft is generally non- 
Markov, and Of no longer depends only on r^. Overall, in comparison with the normal SIS 
estimator fsis (Eq. 1), for the same number of samples TV, Jrbsis is more accurate but is 
also more computationally demanding to compute. 

To see more clearly what is involved in implementing the RB-SIS method, let us look 
at the Rao-Blackwellised belief state, i.e., the belief state of the dynamic process when the 
Rao-Blackwellising variables can be observed: TZt = p{zt,rt,ot\ft-i,ot-i) and its posterior 
IZfj^ = p{zt\ft,ot). All the entities needed in the RB-SIS procedure can be computed from 
these two distributions. Indeed, the functional h can be rewritten in terms of TZt+ as: 

Kh)=l f{zt,rt)p{zt\ft,dt)= f{zt,rt)nt+{zt) (5) 

J Zt J Zt 

In addition, while performing SIS to estimate /i, from Eq. (3), the weight wt and the 
sampling distribution qt can be computed from TZt: 

wtqt= pirt,ot\ft-i,dt-i) = Tlt{rt,ot) = TZt{zt,rt,ot) (6) 

J Zt 

Thus, computing the RB belief state TZt and its posterior Tlt+ is an essential step in 
the RB-SIS method. Since we have to maintain an RB belief state for each sample of 
the RB variables f(, it is crucial that this can be done efficiently using an exact inference 
method. If xt is composed of many variables, as in the case of a DBN, our choice of the 
Rao-Blackwellising variables should be so that the Rao-Blackwellised belief state can be 
maintained in a tractable way. Hence, Rao-Blackwellisation is especially useful when the 
set of variables in a DBN can be split into two parts such that conditioning on the first part 
makes the structure of the second part tractable and amenable to exact inference. 
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Begin 

For i = 0,1,... 

For each sample i = 1, . . . , N 
Sample r^'^ from 7?,('^(rt|ot) 
Update weight w^'^ = w^'^^ TZ^ \ot) 

Compute the posterior RB bel state TZ[tl = TZ[^\zt\T't^\ Ot) 
Compute the new RB belief state TZ^'^^ from Tz['^ 
Compute /i^') from 7^,|*^ 
Compute the estimator fnBSis = J^^^i /I'^-'w'*-' 

End 



Figure 1: RB-SIS for general DBN 



The general RB-SIS algorithm is given in Fig. 1. For illustrating purpose, we assume 
that the "optimal" qt and the corresponding wt are being used {qt = TZt{rt\ot) and wt = 
TZt{ot)). At each time point, we need to maintain N samples r^'\ « = 1, ... ,7V. For each 
sample, in addition to the sample weight lo^'^ , we also need to store a representation of the 
RB belief state corresponding to that sample sequence: Tl[^^ = p{rt,zt,ot\f[^li,dt-i) and 

nfl=pizt\ff\ot). 

A number of applications of the RB-SIS method (also known as the Rao-Blackwellised 
Particle Filter (RBPF)) have been discussed in the literature. A general framework for 
using RB-SIS to do inference on DBNs has been presented by Doucet et al. (2000a), Murphy 
(2000), Murphy and Russell (2001). However, these authors have mainly focused on the 
case where the sequence of the Rao-Blackwellising variables {ft) is Markov (for example, 
when the RB variables are the root nodes at each time slice). This assumption simplifies 
the sampling step in the RB procedure since obtaining the sample for the RB variable 
at time t + 1 is straightforward. In our previous work (Bui, Venkatesh, & West, 2000), 
we introduced a hybrid- inference method for the AHMM in the special case of the state- 
space decomposition policy hierarchy, which is essentially an RB-SIS method. Note that 
when applied to AHMMs, the sequence of Rao-Blackwellising variables that we use does not 
satisfy the Markov property. In this case, care must be taken to design an efficient sampling 
step, especially when the sampling distribution for the next RB variable does not have a 
tractable form. The use of non-Markov RB variables also appears in other special models 
such as the Bayesian missing data model (Liu Sz Chen, 1998), and the partially observed 
Gaussian state space model (Andrieu & Doucet, 2000) where the RB belief state can be 
maintained by a Kalman filter. 

Since we have to make the Rao-Blackwellised belief state tractable, the context vari- 
ables in the framework of context-specific independence (Boutilier, Friedman, Goldszmidt, 
Sz KoUer, 1996) can be used conveniently as Rao-Blackwellising variables (Murphy, 2000). 
Indeed, since the context variable acts as a mixing gate for different Bayesian network struc- 
tures, conditioning on these variables would simplify the structure of the remaining vari- 
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ables. Because of this property of the context variables, Boutilier et al. (1996) have suggested 
to use them as the cut-set variables in the cut-set conditioning inference method (Pearl, 
1988). The cut-set variables play a similar role to the Rao-Blackwellising variables in which 
they help to simplify the structure of the remaining network. In Rao-Blackwellised sam- 
pling, instead of summing over all the possible values of the cut-set variables which can be 
intractable, only a number of representative sampled values are used. 

The idea of combining both exact and approximate inference in RB sampling is also 
similar to the hybrid inference scheme described by Dawid, Kjaerulff, and Lauritzen (1995), 
however it's unclear if RB sampling can be described using their model of communicating 
belief universe. Also, Dawid et al. use hybrid inference mainly to do inference on networks 
with a mixture of continuous and discrete variables, as opposed to RB whose goal is to 
improve the sampling performance. 

3. Abstract Markov Policies 

In this section, we formally introduce the AMP concept as originating from the literature 
of abstract probabilistic planning with MDPs (Sutton et al., 1999; Parr & Russell, 1997; 
Forestier & Varaiya, 1978; Hauskrecht et al., 1998; Dean h Lin, 1995). The main motivation 
in abstract probabilistic planning is to scale up MDP-based planning to problems with large 
state space. It has been noted that a hierarchical organisation of policies can help reduce 
the complexity of MDP-based planning, similar to the role played by the plan hierarchy 
in classical planning (Sacerdoti, 1974). In comparison with a classical plan hierarchy, a 
policy hierarchy can model different sources of uncertainty in the planning process such as 
stochastic actions, uncertain action outcomes, and stochastic environment dynamics. 

While the work in planning is concerned with finding the optimal policy given some 
reward function, our work focuses on policy recognition which is the inverse problem, i.e., to 
infer the agent's policies from watching the effects of the agent's actions. The two problems 
however share a common element which is the model of a stochastic plan hierarchy. In policy 
recognition, although it is possible to derive some information about the reward function 
by observing the agent's behaviour, we choose not to do this, thus omitting from our model 
the reward function and also the optimality notion. This leaves the model open to tracking 
arbitrary agent's behaviours, regardless of whether they are optimal or not. 

3.1 The General Model 
3.1.1 Actions and Policies 

In an MDP, the world is modelled as a set of possible states S, termed the state space. At 
each state s, an agent has a set of actions A available, where each action a, if employed, will 
cause the world to evolve to the next state s' via a transition probability 0-0(5, s'). An agent's 
plan of actions is modelled as a policy that prescribes how the agent would choose its action 
at each state. For a policy tt, this is modelled by a selection function cr^ : 5 x ^4 — >■ [0, 1] 
where at each state s, cr7r(s,a) is the probability that the agent will choose the action a. It 
is easy to see that, given a fixed policy tt, the resulting state sequence is a Markov chain 
with transition probabilities Pr(s' | s) = 'Y^^ aT^{s,a)(Ta{s^s'). Thus, a policy can also be 
viewed as a Markov chain through the state space. 
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3.1.2 Local Policies 

In the original MDP, behaviours are modelled at only two levels: the primitive action 

level, and the plan level (policy). We would like to consider policies that select other 
more refined policies and so on, down a number of abstraction levels. The idea is to form 
intermediate-level abstract policies as policies defined over a local region of the state space, 
having a certain terminating condition, and can be invoked and executed just like primitive 
actions (Forestier &; Varaiya, 1978; Sutton et al., 1999). 

Definition 1 (Local policy). A local policy is a tuple tt = (5, D,^, cr) where: 

• 5 is the set of applicable states. 

• D is the set of destination states, f] : D ^ {0,1] is the stopping probabilities such 
that /3(d) = l,yde D\S. 

• cr : S* X A —> [0, 1] is the selection function. Given the current state s, cr(s, a) is the 
probability that the action a is selected by the policy tt at state s. 

The set S models the local region over which the policy is applicable. S will be called the 
set of applicable states, since the policy can start from any state in S. We shall assume here 
that S is discrete, and thus shall not be concerned with the technical details in generalising 
the AHMM formulation to the continuous state space case. The stopping condition of the 
policy is modelled by a set of possible destination states D and a set of positive stopping 
probabilities P{d), d ^ D where P{d) is the probability that the policy will terminate when 
the current state is d. It is possible to allow the policy to stop at some state outside of 
5, however, for all c? G D \ 5 we enforce the condition that j3{d) = 1, i.e., c? is a terminal 
destination state. Sometimes, we might only want to consider policies with deterministic 
stopping condition. In that case, every destination is a terminal destination: Vc? G D, 
/3{d) = 1. Thus, for a deterministically terminating policy, we can ignore the redundant 
parameter (3, and need only specify the set of destinations D. 

Given a starting state s G 5, a local policy as defined above generates a Markov se- 
quence of states according to its transition model. Each time a destination state c? G D is 
reached, the process stops with probability P{d). Since the process starts from within S, 
but terminates only in one of the states in D, the destination states play the role of the 
possible exits out of the local region S of the state space. 

When we want to make clear which policy is currently being referred to, we shall use 
the subscripted notations S^^, Dtti Pit, Ctt to denote the elements of the policy tt. 

Fig. 2 illustrates how a local policy tt can be visualised. Fig. 2(a) shows the set of 
applicable states S, the set of destinations -D, and a chain starting within S and terminating 
in D. The Bayesian network in Fig. 2(b) provides the detailed view of the chain from start 
to finish. The Bayesian network in Fig. 2(c) is the abstract view of the chain where we are 
only interested in its starting and stopping states. 

3.1.3 Abstract Policies 

The local policy as defined above selects among the set of primitive actions. Similarly, but 
more generally, we can define higher level policies that select among a set of other policies. 



462 



Policy recognition in the Abstract Hidden Markov Model 




Definition 2 (Abstract Policy). Let 11 be a set of abstract policies. An abstract 
policy TT* over the policies in 11 is a tuple {3^^- , D^^* , l^i^* tCTtt*) where: 

• 5,r* C UTrGuSn is the set of applicable states. 

• D,^* C UTren-DTT is the set of destination states, ^jr* '■ -Dtt* (0, 1] is the set of stopping 
probabilities. 

• cr^* : Stt* X n — )■ [0, 1] is the selection function where cttt* (s, tt) is the probability that 
TT* selects the policy tt at the state s. 

Note the recursiveness in definition 2 that allows an abstract policy to select among a set 
of other abstract policies. At the base level, primitive actions are viewed as abstract policies 
themselves. Since primitive actions always stop after one time-step, Da D Sa and I3{d) = 
IVd G Da (Sutton et al., 1999). The idea that policies with suitable stopping condition 
can be viewed just as primitive actions is first made explicit in (Sutton, 1995), which 
also introduces the /3 model for representing the stopping probabilities. Their subsequent 
work (Sutton et al., 1999) introduces the abstract policy concept under the name options. 

The execution of an abstract policy tt* is as follows. Starting from some state s, tt* 
selects a policy tt G 11 according to the distribution (T7r-(.s, .). The selected policy tt is then 
executed until it is terminated in some state d G D^. If d is also a destination state of tt* 
{d G I?7r*), the policy tt* stops with probability j3.,r*{d). If tt* still continues, a new policy 
tt' G n is selected by tt* at d. which will be executed until its termination and so on (Fig. 3). 

Some remarks about the representation of an abstract policy are needed here. Let 
s G U^gH'S'jr, we denote the subset of policies in 11 which are applicable at s by 11(5) = 
{tt G n I s G Stt}. For an abstract policy tt* to be well-defined, we have to make sure that 
at each state s, it* only selects among the policies that are applicable at s. Thus, the 
selection function has to be such that (T7r-(.s,7r) > only if tt G n(s). This helps to keep 
the specification of the selection function to a manageable size, even when the set of all 
policies n to be chosen from can be large. In addition, the specification of the selection 
function and the stopping probabilities can make use of factored representations (Boutilier, 
Dearden, &; Goldszmidt, 2000) in the case where the state space is the composite of a set 
of relatively independent variables. This ensures that we still have a compact specification 
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of the probabilities conditioned on the state variable, even though the state space can be 
of high dimension. 

3.1.4 Policy Hierarchy 

Using abstract policies as the building blocks, we can construct a hierarchy of abstract 
policies as follows: 

Definition 3 (Policy hierarchy). A policy hierarchy is a sequence % = (IIo, Hi, ... , II^^:) 
where K is the number of levels in the hierarchy, Ho is a set of primitive actions, and for 
A; = 1, . . . , i^, H/j is a set of abstract policies over the policies in lik-i- 

When a top-level policy TT^ is executed, it invokes a sequence of level-(K-l) policies, each 
of which invokes a sequence of level- (K-2) policies and so on. A level- 1 policy will invoke 
a sequence of primitive actions which leads to a sequence of states. Thus, the execution 
of TT^ generates an overall state sequence (sq, ,si, . . . , ,S(, . . .) that terminates in one of the 
destination states in D^k. When K = 1 this sequence is simply a Markov chain (with 
suitable stopping conditions). However, for > 2, it will generally be non-Markovian, 
despite the fact that all the policies are Markov, i.e., they select the lower level policies 
based solely on the current state (Sutton et al., 1999). This is because knowing the current 
state si alone does not provide information about the current intermediate-level policies, 
which can affect the selection of the next state st+i- Intuitively, this means that an agent's 
behaviour to achieve a given goal is usually non-Markovian, since its choice of actions 
depends not only on the current state, but also on the current intermediate intentions of 
the agent. 

We term the dynamical process in executing a top-level abstract policy tt^ the Abstract 
Markov Model (AMM). When the states are only partially observable, the observation can 
be modelled by the usual observation model Pr(ot | Sf) = Lo{st, ot). The resulting process is 
termed the Abstract Hidden Markov Model (AHMM) since the states are hidden as in the 
Hidden Markov Model (Rabiner, 1989). 

The idea of having a higher level policy controlling the lower level ones in an MDP 
can be traced back to the work by Forestier and Varaiya (1978), who investigated a two 
layer structure similar to our 2-level policy hierarchy with deterministic stopping condition. 
Forestier and Varaiya showed that that the sub-process, obtained by sub-sampling the state 
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(b) 



Figure 4: The environment and its partition 



sequence at the time when the level-1 pohcy terminates, is also Markov, thus the pohcies 
at level 1 simply play the role of an "extended" action. In our framework, given a policy 
hierarchy, one can consider a "lifted" model where only the policies from level k up and the 
observations at the time points when a policy at level k ends are considered. The level-A; 
policies can then be considered as primitive actions, and the lifted model can be treated 
like a normal model. 



3.2 State-Space Region-Based Decomposition 

In some cases, the state space or some of its dimensions already exhibit a natural hierarchical 
structure. For example, in the spatial domain, the set of ground positions can be divided 
into small local spaces such as rooms, corridors, etc. A set of these local spaces can be 
grouped together to form a larger space at the higher level (floors, buildings, etc). An 
intuitive and often-used method for constructing the policy hierarchy in this case is via 
the so-called region-based decomposition of the state space (Dean Sz Lin, 1995: Hauskrecht 
et al., 1998). Here, the state space S is successively partitioned into a sequence of partitions 
'Pk,'Pk-i, ■■■'Pi corresponding to the K levels of abstraction, where Vk = {S} is the coarsest 
partition, and Vi is the finest. For each region i?j of Vi, the periphery of Per{Ri) is 
defined as the set of states not in but connected to some state in Ri. Let Per-i be the 
set of all peripheral states at level i: Peri = V^R^^■p^Per{Ri). Fig. 4(b) shows an example 
where the state space representing a building is partitioned into 4 regions corresponding to 
the 4 rooms. The peripheral states for a region is shown in Fig 4(a), and Fig 4(b) shows all 
such peripheral states. 

To construct the policy hierarchy, we first define for each region i?i G Vi a set of abstract 
policies applicable on and having Per{Ri) as the destination states. For example, for 
each room in Fig 4, we can define a set of policies that model the agent's different behaviours 
while it is inside the room, e.g., getting out through a particular door. These policies can 
be initiated from inside the room, and terminate when the agent steps out of the room 
(not necessarily through the target door since the policy might fail to achieve its intended 
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target). Note that since Per{Ri) fl i?i = 0, all the policies defined in this manner have 
deterministic stopping conditions. 

Let the set of all policies defined be Hi. At the higher level V2, for each region R2, 
we can define a set of policies that model the agent's behaviours inside that region with 
applicable state space R2, destination set Per(i?2), and the constraint that these policies 
must use the policies previously defined at level- 1 to achieve their goals. An example is a 
policy to navigate between the room-doors to get from one building gate to another. Let 
the set of all policies defined at this level be 112. Continuing doing this at the higher levels, 
we obtain the policy hierarchy T-L = (no,ni,n2, . . . ,11^-). A policy hierarchy constructed 
through State-space Region-based Decomposition is termed an SRD policy hierarchy. 

An SRD policy hierarchy has the property that the set of applicable states of all the 
policies at a given abstraction level forms a partition of the state space. Thus, from the state 
sequence (sq, . . . , st, . . .) resulting from the execution of the top level policy, we can infer 
the exact starting and terminating times of all intermediate-level policies. For example, at 
level k, the starting/stopping times of the policies in this level are the time indices fs at 
which the state sequence crosses over a region boundary: s^-i G and Sf ^ Rk for some 
region R^ of the partition Vk- Later in section 5.1, we will show that this property helps to 
simplify some of the complexity of the policy recognition problem. 

3.3 A Policy Hierarchy Example 

As an example, consider the task to monitor and predict the movement of an agent through 
a building shown in Fig. 5(a). Each room is represented by a 5 x 5 grid, and two adjacent 
rooms are connected via a door in the center of their common edge. The four entrances to 
the building are labeled north (N), west (W), south (S) and east (E). In addition, the door 
in the center of the building (C) acts like an entrance between the building's north wing 
and south wing. At each state (cell), the agent can move in 4 possible directions except 
when it is blocked by a wall. 

The policy hierarchy to model the agent's behaviour in this environment can be con- 
structed based on region-based decomposition at three levels of abstraction. Firstly, a region 
hierarchy is constructed. The partition of the environment consists of the 8 rooms at level 1, 
the two wings (north and south) at level 2, and the entire building at level 3. The behaviours 
of the agent at level 1 (within each room) is represented by a set of level 1 policies. For 
example, in each room, we use 4 level-1 policies to model the agent's behaviours of exiting 
the room via the 4 different doors. These are essentially four Markov chains within the room 
which terminate outside of the room. One way to represent these policies is to specify which 
movement action the agent should take given the current position and the current heading. 
At the higher level, the agent's behaviours within each wing are specified. For example, 
we use 3 level-2 policies in each wing to model the agent's behaviours of exiting the wing 
via the 3 wing exits. These policies are built on top of the set of level-1 policies already 
defined. They specify which level-1 policies the agent should take to leave the wing at the 
intended exit. Finally, at the top level, the agent's behaviours within the entire building 
can be specified. For example, we use 4 top-level policies to model the agent's behaviours 
of leaving the building via the four building exits N, W, S, E. A sample of these policies 
and their parameters is given in Fig. 5(b). 
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Level 1 Policy. (Destination is on the right) 
Up 



2:^ Right 



W Riiht-door 
Back-door 




Down 



Level 3 (current state: W, destination: E) 




0^ Go to C (level 2 policy) 



Go to S (level 2 policy) 



(a) The environment 



Level 2 Policy (current state: W, destination : C) 

0.8 

W\ " Go to right-door (level 1 policy) 



Go to back-door (level 1 policy) 



Prior for top-level policy 



N: 0.25, S: 0.25, E: 0.25, W: 0.25 



(b) Parameters of the AHMM 



Figure 5: An example policy hierarchy 



3.4 AMM as a Plan Execution Model 

Up to now, we have presented the AMM as a formal plan execution model to be used later 
in the plan recognition process. In this subsection, we discuss the expressiveness of the 
AMM as a formal plan specification language, and also the suitability of using the AMM 
to encode plans in the context of plan recognition. Note that the discussion here focuses on 
the representational aspect of the AMM alone. A discussion of the computational aspects 
of the AMM/ AHMM in comparison with other works in probabilistic plan recognition will 
be presented in Section 7. 

The AMM is particularly well-suited for representing goal-directed behaviours at dif- 
ferent levels of abstraction. Each policy in the AMM can be viewed as a plan trying to 
achieve a particular goal. However, unlike a classical plan, a policy specifies the course of 
actions at all applicable states, and is more similar to a contingent plan. The ending of a 
policy could either means that the goal has been achieved, or the attempt to achieve the 
goal using the current policy has failed. This interpretation of the persistence of a policy 
fits into the persistence model of intentions (Cohen & Levesque, 1990): when an intention 
ends, there is no guarantee that the intended goal has been achieved. Thus, conceptually, 
there are two types of destination states: one corresponds to the intended goal states, and 
the other corresponds to unintended failure states resulting from the stochastic nature in 
the execution of the plan. Due to its generality, the AMM does not need to distinguish be- 
tween these two types; both the successful termination states and the unsuccessful ones are 
treated the same as possible destination states, albeit with different reaching probabilities.^ 



4. One would expect that an agent would more likely to reach the intended destination state rather a 
random failure state. 
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Using the AMM as a model of plan execution thus allows us to blur the difference 
between planning and re-planning. At the same time, it moves from the recognition of 
a classical plan towards the recognition of the agent's intention. Most of the existing 
framework for probabilistic plan recognition does not explicitly represent the current state, 
and thus, the relationship between states and the adoption and termination of current plans 
is ignored (Goldman et al., 1999).^ Thus, it would be impossible to tell if the current plan 
has failed and the new plan is an attempt to recover from this failure, or the current plan 
has succeeded and the new plan is part of a new higher level goal. 

A more expressive language for describing abstract probabilistic plan is the Hierarchical 
Abstract Machines (HAM) proposed in (Parr & Russell, 1997: Parr, 1998). In a HAM, the 
abstract policy is replaced by a stochastic finite automaton, which can call other machines 
at the lower level. Our abstract policies can be written down as machines of this type. Such 
a machine would choose one of the machines correspond to the policies at the lower level 
and then go back to the start state after the called machines have terminated. The HAM 
framework allows for machines with arbitrary finite number of machine states and transition 
probabilities,*^ thus can readily represent more complex plans such as concatenation of 
policies, alternative policy paths, etc. It is possible to represent each machine in HAM 
as a policy in our AMM, however with the cost of augmenting the state space to include 
the machine states of all the machines in the current call stack. Thus, the size of the 
AMM's new state space would be exponential with respect to the number of nested levels 
in the HAM's call stack. While this shows in theory the expressiveness of HAM and our 
policy hierarchy is the same, performing policy recognition on the HAM-equivalent policy 
hierarchy is probably unwise since the state space becomes exponentially large after the 
conversion. A better idea would be to represent the internal state of each machine as a 
variable in a DBN and perform inference on this DBN structure directly. 

The AMM is also closely related to a model for probabilistic plan recognition called the 
Probabilistic State-Dependent Grammar (PSDG), independently proposed in (Pynadath, 
1999; Pynadath & Wellman, 2000). The PSDG can be described as the Probabilistic 
Context Free Grammar (PCFG) (Jelinek, Lafferty, & Mercer, 1992), augmented with a 
state space, and a state transition probability table for each terminal symbol of the PCFG. 
In addition, the probability of each production rule is made state dependent. As a result, 
the terminal symbol now acts like primitive actions and the non-terminal symbol chooses its 
expansion depending on the current state. Interestingly, the PSDG is directly related to the 
HAM language described above, similar to the way production-rule grammars are related 
to finite automata. Given a PSDG, we can convert it to an equivalent HAM by constructing 
a machine for each non-terminating symbol, and modelling the production rules for each 
non-terminating symbol by the automaton. 

Our policy hierarchy is equivalent to a special class of PSDG where only production 
rules of the form X — )■ YX and A" — > are allowed. The former rule models the adoption 
of a lower level policy y by a higher level policy X, while the latter models the termination 
of a policy X. The PSDG model considered in (Pynadath, 1999; Pynadath & Wellman, 
2000) allows for more general rules of the form X ^ Yi . . . Y^X, i.e., the recursion symbol 

5. with the exceptions of (Goldman et al., 1999; Pynadath &; Wellman, 2000) which will be discussed in 
detail in Section 7. 

6. with the constraint that there is no recursion in the calling stack to keep the stack finite. 
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must be located at the end of the expansion. Thus in a PSDG, a policy might be expanded 
into a sequence of policies at the lower level which will be executed one after another before 
control is returned to the higher level policy. The implicit assumption here is that when a 
policy in the sequence terminates, it always does so at a state where the next policy in the 
sequence is applicable. Given this assumption, in the language of the AHMM we can define 
a compound policy tt^ as a policy that simply and orderly executes a sequence of policies at 
the lower level T^f^^, ■ ■ ■ , 'J^f~), independent of the current state. A PSDG is then equivalent 



to an AHMM if compouna policies of this form are allowed. 

Since the AMM closely follows the models used in abstract probabilistic planning, it can 
be used to model and recognise the behaviours of any autonomous agent whose decision 
making process is equivalent to an abstract MDP. It is also useful as a formal language 
for specifying contingent plans whose execution can then be monitored using the policy 
recognition algorithm. The language is also rich enough to specify a range of useful human 
behaviours, especially in domains where there is a natural hierarchical decomposition of 
the state space. Section 6 presents an application of the AHMM framework to the problem 
of recognising people behaviours in a complex spatial environment. Here, each policy of 
the AHMM represents the evolution of possible trajectories of people movement while the 
person performs a certain task in the environment such as heading towards a door, using 
the computer at a certain location, etc. The policies at different levels would represent the 
evolution of trajectories at different levels of abstraction. Due to the existing hierarchy in 
the domain, the policies can be constructed using the region-based decomposition of the 
state space. The environment is populated with multiple cameras divided into different 
zones that can provide the current location of the tracking target, albeit a noisy one. The 
noisy observations can be readily handled by the observation model in the AHMM. The 
policy recognition algorithm can then be applied to infer the person's current policy at 
different levels in the hierarchy. 

One main restriction of the current AHMM model is that we consider only one top- 
level policy at a time, thus are unable to model the inter-leaving of concurrent plans. 
Another more subtle restriction is the assumption that a high level policy selects the lower 
level policies depending only on the current state. If the state space is interpreted as the 
states of the external environment, this assumption implies that the actor either has full 
observation about the current state, or at least refines its intentions based on the actor's 
observation about the current state only (and not the entire observation history). Note that 
these restrictions of the AHMM also apply in the case of the PSDG model. 

4. Dynamic Bayesian Network Representation 

In this section, we describe the Dynamic Bayesian Network (DBN) representation of the 
AHMM. The network serves two purposes: (1) as the tool to derive the probabilistic in- 
dependence property of this stochastic model, and (2) as the computational framework for 
the policy recognition algorithms in Section 5. 

4.1 Network Construction 

At time t, let st represent the current state, represent the current policy at level k 
{k = (},...,K), represent the ending status of -k^, i.e., a boolean variable indicating 
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Figure 6: Sub-network for policy termination 



whether the policy tt^ terminates at the current time. These variables would make up the 
current time-slice of the full DBN. For our convenience, the notation Trf' refers to the set of 
all the current policies {tt/'-, . . . ivrj*}. Before presenting the full network, we first describe 
the two sub-structures that model how policies are terminated and selected. The full DBN 
can then be easily constructed from these sub-structures. 

4.1.1 Policy Termination 

Prom the definition of abstract policies, a level-A; policy tt^ terminates only if the lower level 
policy TT^^^ terminates, and if so, terminates with probability P^k{st). In the Bayesian 

network representation, the terminating status therefore has three parent nodes: tt^, st, 
and (Fig. 6(a)). 

The parent variable e^~^ however plays a special role. If e^~^ = T, meaning the lower 
level policy terminates at the current time, Pr(e( = T|7r(^,Si) = /3^k{st) which gives the 

conditional probability of given the other two parent variables (Fig. 6(b)). However, if 
gfc-i _ giiould not terminate and so = F. Therefore, given that ef^^ = F, is 

deterministically determined and is independent of the other two parent variables tt^ and 
St. Using the notion of context-specific independence (CSI) (Boutilier et al., 1996), we can 
then safely remove the links from the other two parents to in the context that e^^^ is 
false (Fig. 6(c)). 

At the bottom level, since the primitive action always terminates immediately, = T 
for all t. Since we are modelling the execution of a single top-level policy tt^, we can assume 
that the top-level policy does not terminate and remains unchanged: ef = F and = 
for all t. Also, note that e[ = T ^ = T for all k < I, and e[ = F ^ e'l = F for all k > I. 
Thus, at each time t, there exists < It < K such that ef = T for all k < It, and = F 
for all k > It- The variable It is termed the highest level of termination at time t. Knowing 
the value of It is equivalent to knowing the terminating status of all the current policies. 

4.1.2 Policy Selection 

The current policy 7r|^ in general is dependent on the higher level policy tt^"*"^, the previous 
state st-i, the previous policy at the same level 7r^_^ and its ending status ef_|. In the 
Bayesian network, tt^ thus has these four variables as its parents (Fig. 7(a)). This depen- 
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Figure 7: Sub-network for policy selection 



dency can be further broken down into two cases, depending on the value of the parent 
node e^_i. 

If the previous policy has not terminated (e^_^ = F), the current policy is the same as 
the previous one: tt^ = 7r^_i, and the variable is thus independent of tt^"*"^ and .S(-i. 
Therefore, in the context e^_^ = F, the two links from Tr^~^^ and st-i to the current policy 
can be removed, and the two nodes tt^ and 7r^_^ can be merged together (Fig. 7(b)). 

If the previous policy has terminated (e^_^ = T), the current policy is selected by the 
higher level policy with probability Pr(7r|^ | tt^"*"^, st-i) = c^fe+i {st-i, T^t)- ^^^^ context, tt^ 
is independent of 7r^_^ and the corresponding link in the Bayesian network can be removed 
(Fig. 7(c)). 

4.1.3 The Full DBN 

The full dynamic Bayesian network can be constructed for all the policy, ending status, and 
state variables by putting the sub-networks for policy termination and selection together 
(Fig. 8). At the top level, since = F, we can remove the ending status nodes and merge 
all the into a single node . At the base level, since = T, we can remove the ending 
status nodes and also the links from 7r° to 7r°_^^. To model the observation of the hidden 
states, an observation layer can be attached to the state layer as shown in Fig. 8. 

Suppose that we are given a context where each of the variable is known. We can 
then modify the full DBN using the corresponding link removal and node merging rules. 
The result is a more intuitive tree-shaped network in Fig. 9, where all the policy nodes 
corresponding to the same policy for its entire duration are grouped into one. The grouping 
can be done since knowing the value of each is equivalent to knowing the exact duration 
of each policy in the hierarchy. One would expect that performing probabilistic inference 
on this structure is more simple than that of the full DBN in Fig. 8. In particular, if the 
state sequence is known, the remainder of the network in Fig. 9 becomes singly-connected, 
i.e., a directed graph with no undirected cycles, allowing inference to be performed with 
complexity linear to the size of the network (Pearl, 1988). The policy recognition algorithms 
that follow later exploit extensively this particular tractable case of the AHMM. 



471 



Bui, Venkatesh & West 





472 



Policy recognition in the Abstract Hidden Markov Model 



4.2 Conditional Independence in the Current Time-Slice 

The above discussion identifies a tractable case for the AHMM, but it requires the knowledge 
of the entire history of the state and the policy ending status variables. In this subsection, 
we focus on the conditional independence property of the nodes in the current time-slice: 
st,n^ , . . . ^nf . Since these nodes will make up the belief state of any future inference 
algorithm for our AHMM, any independence properties among these variables, if exploited, 
can provide a more compact representation of the belief state and reduce the inference 
complexity. 

Due to the way policies are invoked in the AMM, we can make an intuitive remark 
that the higher level policies can only influence what happens at the lower level through the 
current level. More precisely, for a level k policy tt^, if we know its starting state, the course 
of its execution is fully determined, where being determined here means without influence 
from what is happening at the higher levels. Furthermore, if we also know how long the 
policy has been executed, or equivalently its starting time, the current state of its execution 
is also determined. Thus, the higher level policies can only influence the current state of 
execution of 7rf either through its starting state or starting time. In other words, if we know 
TTf together with its starting time and starting state, then the current higher level policies 
are completely independent of the current lower level policies and the current state. The 
theorem 1 below formally states this in a precise form. Note that the condition obtained 
is the strictest: if one of the three conditional variables is unknown, there are examples of 
AMMs in which the higher level policies can influence the lower level ones. 

Theorem 1. Let and bf be two random variables representing the starting time and the 
starting state, respectively, of the current level-k policy tt^: = max{t' <t\e^i = T} and 
b^ = s^k. Let TT^^ = {Tr^"*"^, . . . ,7r/^} denote the set of current policies from level k + 1 up 

to K, and Trf" = {st, tt^ , . . . , tt^"^} denote the set of current policies from level k — I down 
to together with the current state. We have: 

^>'±n<'\4,b^,r,' (7) 

Proof. We sketch here an intuitive proof of this theorem through the use of the Bayesian 
network manipulation rules for context-speciflc independence which have been discussed in 
4.1.1 and 4.1.2. An alternative proof that does not use CSI can be found in (Bui et al., 
2000). 

We first note that the theorem is not obvious by looking at the full DBN in Fig. 8. 
Therefore, we shall proceed by modifying the network structure in the context that we 
know . 

At time r/', all the policies at level k and below must terminate: e\ =T for all / < k. 

Thus we can remove all the links from these policies to the new policies at time r/^ + 1. 

On the other hand, from time + 1 until the current time all the policies at level k 
and above must not terminate: e\i = F for all I > k, + I < t' < t. Thus we can group all 
the policies at level / > k between time + 1 and t into one node representing the current 
policy at level /. 

These two network manipulation steps result in a network with the structure shown in 
Fig. 10. Once the modified network structure is obtained, we can observe that ir^'' and 
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Figure 10: Network structure after being conditioned on 



■nf^ are d-separated by 'K^ and in the new structure. Thus and nf^ are independent 
given 7r|^, and t^. □ 

5. Policy Recognition 

In this section we begin to address the problem of poUcy recognition in the framework of 
the AHMM. We assume that a policy hierarchy is given and is modelled by an AHMM, 
however the top level policy and the details of its execution are unknown. The problem is 
then to determine the top level policy and other current policies at the lower levels given 
the current sequence of observations. In more concrete terms, we are interested in the 
conditional probability: 

Pr(7rf,...,7r°|ot_i) 

and especially, the marginals: 

Pr(7r|^ I ot_i), for all levels k 

Computing these probabilities gives us the information about the current policies at all 
levels of abstraction, from the current action {k = 0), to the top-level policy {k = K), 
taking into account all the observations that we have up to date. 

In typical monitoring situations, these probabilities need to be computed "online", as 
each new observation becomes available. To do this, it is required to update the belief 
state (filtering distribution) of the AHMM at each time point t. This problem is generally 
intractable unless the belief state has an efficient representation that affords a closed form 
update procedure. In our case, the belief state is a joint distribution oi K + 3 discrete 
variables: Pr(7r/^, . . . , tt^ , st, | ot). Without any further structure imposed on the belief 
state, the complexity for updating it is exponential in K. 

To cope with this complexity, one generally has to resort to some form of approximation 
to trade off accuracy for computational resources. On the other hand, the analysis of the 
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AHMM network in the previous section suggests that the problem of inference in the AHMM 
can be tractable in the special case when the history of the state and terminating status 
variables is known. Motivated by this property of the AHMM, our main aim in this section 
is to derive a hybrid inference scheme that combines both approximation and tractable 
exact inference for efficiency. We first treat the special case of policy recognition where 
the belief state of the AHMM has a tractable structure in 5.1. We then present a hybrid 
inference scheme for the general case using the Rao-Blackwellised Sequential Importance 
Sampling (RB-SIS) method in 5.2. 

5.1 Policy Recognition: the Tractable Case 

Here, we address the policy recognition problem under two assumptions: (1) the state 
sequence can be observed with certainty, and (2) the exact time when each policy starts 
and ends is known. More precisely, our observation at time t includes the state history 
St = {sqt ■ ■ 1 St) and the policy termination history It = (/q, . . . , The belief state that 
we need to compute in this case is Bt = Pr(7r"^^, s^, | /^-i) and its posterior after 
absorbing the observation at time t: Bt+ = Pr(7rf^^ | sj, /(). 

The first assumption means that the observer always knows the true current state and is 
often referred to as "full observability" . When the states are fully observable, we can ignore 
the observation layer {ot} in the AHMM and thus only have to deal with the AMM instead. 
The second assumption means that the observer is fully aware when the current policy 
ends and a new policy begins. If the policy hierarchy is constructed from the region-based 
decomposition of the state space (subsection 3.2), the termination status can be inferred 
directly from the state sequence. Thus for SRD policy hierarchies, only the full observability 
condition is needed since the second assumption is subsumed by the first and can be left 
out. Except for SRD policy hierarchies, these two assumptions are usually too restrictive 
for the policy recognition algorithm presented here to be useful by itself. However, the 
algorithm for this special case will form the exact step in the hybrid algorithm presented in 
subsection 5.2 for the general case. 

5.1.1 Representation of the belief state 

We first look at the conditional joint distribution Pr(7r"^^, st \ st-i, h-i)- From the termina- 
tion history h-i, we can derive precisely the starting time of the current level-k policy: 

Tt = max{0} U{t' <t\ et, = T} = max{0} U {t' < t\ ki > k} 

On the other hand, knowing the starting time together with the state history also gives 
us the starting state b^. Thus, both the starting time and the starting state of iTt can be 
derived from st-i and It-i- From Theorem 1, we obtain for all level k: 

„>k I „<k I „k ~ J 
-L \Trt,st-i,lt-i 

In other words, given s^-i and k-i, the conditional joint distribution of {tt/^, . . . , 7r°, s^} 
can be represented by a Bayesian network with a simple chain structure. We denote this 
chain network by Ct = Pr(7r-"'', St \ St-i, h-i) and term it the belief chain for the role it plays 
in the representation of the belief state (Fig. 11(a)). If a chain is drawn so that all links 
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(a) (b) 



Figure 11: Representation of the belief state 



point away from the level-k node, we say that the chain has root at level k. The root of the 
chain can be moved from k to another level k' simply by reversing the links lying between 
k and k' using the standard link-reversal operation for Bayesian networks (Shachter, 1986). 

Each node in the belief chain also has a manageable size. In principle, the domain of 
TT^ is n'^, the set of all policies at level k, and the domain of st is 5", the set of all possible 
states. When K is large, we basically want to model a larger state space, and the set 
of policies to cover this state space is also large. The sizes of these domains would most 
likely grow exponential w.r.t. K. However, given a particular state, the number of policies 
applicable at that state would remain relatively constant and independent of K. For each 
policy TT^, we know its starting state 6f, which implies that nf G 11*^(6^), the set of all 
level- A; policies applicable at b^. Thus n''(6j) can be used as the "local" domain for 
to avoid the exponential dependency on K. Similarly, the domain for st can be taken as 
the set of neighbouring states of (reachable from st-i by performing one primitive 
action). For a given state, we term the maximum number of relevant objects (applicable 
policies/actions, neighbouring states) at a single level the degree of connectivity J\f of the 
domain being modelled. The size of the conditional probability table for each link of the 
belief chain is then OiN"^), and the overall size of the belief chain is 0{KJ\f^). 

We now can construct the belief state Bt from Ct- Since the current terminating status 
is solely determined by the current policies and the current state, the belief state Bt can be 
factorised into: 

Pr(7rf , su It I St-iJt-i) = Pr(/t | Trf , s*) Pr(7rf , st \ h-uk-i) = Pr(/t | Trf , st)Ct 

Note that the variable It is equivalent to the set of variables {ef , . . . ,e|}. Thus, the full 
belief state Bt can be realised by adding to Ct the links from the current policies and the 
current state to the terminating status variables (Fig. 11(b)). The size of the belief state 
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(a) (b) 



Figure 12: Belief state updating: from Bt to 



would still be 0{KJ\f'^). If the state is a composite of many orthogonal variables, a factored 
representation can be used so that the size of the belief state representation does not depend 
exponentially on the dimensionality of the state space. We discuss factored representations 
further under subsection 5.2.2. 

5.1.2 Updating the belief state 

Since the belief state Bt can be represented by a simple belief network in Fig. 11(b), we can 
expect that a general exact inference method for updating the belief state such as (Kjaerulff, 
1995) will work efficiently. However, this general method works with undirected network 
representation of the belief state distribution which can be inconvenient for us later on when 
we want to sample from such a distribution. Here, we describe an algorithm that updates 
the belief state in the closed form given by the directed network in Fig. 11(b). 

Assuming that we have a complete specification of the belief state Bt, i.e., all the pa- 
rameters for its Bayesian network representation, we need to compute the parameters for 
the new network Bt-\-i. This is done in two steps, as in the standard "roll-over" of the belief 
state of a DBN: (1) absorbing the new evidence Sf, It and (2) projecting the belief state 
into the next time step. 

The first step corresponds to the instantiation of the variables st, e|,...,ef^ in the 
Bayesian network Bi to obtain Bt+ which is the conditional joint distribution of yrf" , . . . , tt^. 
By checking the conditional independence relationships in Fig. 11(b), it is easy to see that 
Bt-y. again has a simple chain network structure. Thus, conceptually, the problem here is 
to update the parameters of the chain Ci so as to absorb the given evidence to form a new 
chain Bfj^. This can be done by a number of link-reversal steps as follows. 

To instantiate st, we first move the root of the chain Ct to st- The variable st then has 
no parents and can be instantiated and deleted from the network (Fig. 12(a)). 

To instantiate li which is equivalent to the value assignment {ef = F, . . . , e^*"*"^ = F, ej* = 
T, . . . e] = T), starting from A; = 1, we iteratively reverse the links from 7r^~^ to vrf and 
from TT^ to (Fig. 12(b)). In algebraic forms, the first link reversal operation corresponds 
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to computing the following probabilities: 

Vr{4\suel...,e',-') = J]Pr(7rl=|7r,^-i)Pr(7rri|.„el,...,e^i) (8) 

PT{nt'\nls„el...,et') oc ViinUnt'mnt' \ st,el . . . ,et') (9) 

and the second link reversal corresponds to: 

Pv{4 I St, , . . . , e^) oc Pr(e^ 1 7^^ st, ef) \ st, e],..., ef) (10) 

Effectively, the A;-th link reversal step positions the root of the chain Ct at tt^ and absorbs 
the evidence e^. By repeating this link reversal operations with A; = 1, ...,/( + 1, we obtain 
a new chain for Bt+ which has root at level k + l- Note that there is no need to incorporate 
the instantiations = F for > + 1 since they are the direct consequences of the 
instantiation e^*"*"^ = F. The parameters of the chain Bt+ are given below. The upward 
links remain the same as those of Ct, while the marginal at level It + I and the downward 
links are obtained as the results of the link reversal operations above: 

Pvi4+'\4,st,lt) = Pv{n^+'\4),k>lt + l 

Pj:{n^\st,lt) = Prin^ \ st,el,...,e^),k = lt + l 
Pi{4-'\4,st,lt) = Pr:{4-'\4,st,el...,et'),k<lt 

In the second step, we continue to compute Ct+i from Bt+. Since all the policies at 
levels higher than It do not terminate, n^^i = t^^''\ and we can retain this upper sub-chain 
from Bt-\- to Ct+i- In the lower part, for k < It, a new policy n■^_^_l is created by the policy 
7r^_jt]^ at the state st, and thus a new sub-chain can be formed among the variables tt^"^*^ with 
parameters Pr(7r|^_^^ 1 7r^_j!"/, st) = c^fe+i {st, T^f+i)- Note that the domain of the newly-created 

node TT^^-^ is W'ist). The new chain Ct+i is then the combination of these two sub-chains, 
which will be a chain with root at level k + I (see Fig. 13). Once we have the chain Ct+i, 
the new belief state Bt+i can be obtained by simply adding the terminating status variables 

{e^^+il to Ct+i. 

This completes the procedure for updating the belief state from Bi to ^?t+i, thus allowing 
us to compute the belief state Bt at each time step. Although the belief state is the joint 
distribution of all the current variables, due to its simple structure, the marginal distribution 
of a single variable can be computed easily. For example, if we are only interested in the 
current level-A; policy vrf, the marginal probability Pr(7r|^ | .S(-i,^t-i) is simply the marginal 
at the level-A; node in the chain Ct, and can be readily obtained from the chain parameters. 

The complexity of the belief state updating procedure at time t is proportional to It 
since it only needs to modify the bottom It levels of the belief state. On the other hand, the 
probability that the current policy at level I terminates can be assumed to be exponentially 
small w.r.t. I. Thus, the average updating complexity at each time-step is 0{Y^il/exp{l)) 
which is constant-bounded, and thus does not depend on the number of levels in the policy 
hierarchy. In terms of the number of policies and states, the updating complexity is linear 
to the size of a policy node in the belief chain, thus is linear to the degree of connectivity 
of the domain. 
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Figure 13: Belief state updating: from Bt-^ to C^+i 



5.2 Policy Recognition: The General Case 

We now return to the general case of policy recognition, i.e., without the two assumptions of 
the previous subsection. This makes the inference tasks in the AHMM much more difficult. 
Since neither the starting times nor the starting states of the current policies are known 
with certainty, theorem 1 cannot be used. Thus, the set of current policies no longer forms 
a chain structure as it did in Ct since the conditional independence properties of the current 
time-slice no longer hold. We therefore cannot hope to represent the belief state by a simple 
structure as we did previously. An exact method for updating the belief state will thus have 
to operate on a structure with size exponential in K, and is bound to be intractable when 
K is large. 

To cope with this complexity, an approximation scheme such as sequential importance 
sampling (SIS) (Doucet et al., 2000b; Liu Sz Chen, 1998; Kanazawa et al., 1995) can be 
employed. In our previous work (Bui, Venkatesh, &; West, 1999), we have applied an SIS 
method known as the likelihood weighting with evidence reversal (LW-ER) (Kanazawa et al., 
1995) to an AHMM-like network structure. However the SIS method needs to sample in the 
product space of all the layers of the AHMM and thus becomes less accurate and inefficient 
with large K. The key to get around this inefficiency is to utilise the special structure of 
the AHMM, particularly, its special tractable case, to keep the set of variables that need to 
be sampled to a minimum. 

The improvement of the SIS method to achieve this is has been presented in subsec- 
tion 2.5 in the name of the Rao-Blackwellised SIS (RB-SIS) method. Rao-Blackwellisation 
specifically allows the marginalisation of some variables analytically and only samples the 
remaining variables. As a result, this reduces the averaged error, measured as the variance 
of the estimator (Casella & Robert, 1996). 
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In order to apply RB-SIS to the AHMM, the main problem is to identify which vari- 
ables should be used as the Rao-Blackwellising variables and should still be sampled, with 
the remaining variables being marginalised analytically. The key to choosing the Rao- 
Blackwellising variables, as we have shown in 2.5, is so that if those variables can be observed, 
the Rao-Blackwellised belief state becomes tractable. In subsection 5.1, we have demon- 
strated that if the state history ,s; and the terminating status history It can be observed 
then the belief state has a simple network structure and can be updated with constant av- 
erage complexity. Thus, {st,lt) can be used conveniently as the Rao-Blackwellising variable 
rt- Note that the variables It are the context variables which help to simplify the network 
structure of the AHMM, while the state variables ,S( help to make the remaining network 
singly-connected so that exact inference can operate efficiently (see subsection 4.1.3). 

5.2.1 RB-SIS FOR AHMM 

We now discuss the specific application of RB-SIS to the problem of belief state updating 
and policy recognition in the AHMM. Our main objective is to use RB-SIS to estimate the 
conditional probability of the policy currently being executed at level-A; given the current 
sequence of observations Pr(7r^_^^ | dt). 

Mapping the RB-SIS general framework in subsection 2.5 to the AHMM structure, the 
set of all current variables xt is now the set of current policies, terminating status nodes, 
and the current state: Xi = {nfK st,lt)- The jM'obability under estimation Pr(7rf_^^ \ ot) can 
be viewed as an expectation by letting /{nf^Stdt) = Pr(7rf+i |7r"^^ -St, /t) so that: 

/= Yl P<4+i\<'',StJt)Pri^f,St,lt\ot)=Prinli\dt) 

Using RB-SIS to estimate this expectation, we shall split Xi into two sets of variables: 
the set of RB variables rt = (st,^t), and the set of remaining variables zt = nf which is the 
set of all the current policies. The functional h, which depends only on the RB variables 
and is obtained from / by integrating out the remaining variables (Eq. (5)), now has the 
form: 

h{ft) = h{St, It) = Pr(7rf+i I Trf , k) Pr(7rf | S*, 5*) = Pr(7r*'+i | S*, U) (11) 

which is the marginal Cj+i(7r^_^^) from the belief chain at time t+1. 

The RB belief state, which is the belief state of the AHMM when the RB variables are 
known, becomes: 

Ut = Pr(7rf , St, It, ot I St-iJt-i,5t-i) = Pr(7rf , s*, | h-ijt-i) (12) 

and is identical to the special belief state Bt discussed in subsection 5.1, except a minor 
modification to attach the observation variable 0(. 

Prom (11) and (12), both the h function and the RB belief state can be computed 
very efficiently using the exact inference techniques described in 5.1. Thus RB-SIS can be 
implemented efficiently with minimal overhead in exact inference. 

The main RB-SIS algorithm for the AHMM is given in Fig. 14. Note that we only need 
to sample the RB variables st and It. For each sample i, in addition to the weights w^''\ 
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Begin 

For i = 0,1,... 

For each sample i = 1, . . . , N 

Sample s\^\ll^^ from B^'^ {st,lt\ot) 
Update weight w^'^ = w^'^ B'l^\ot) 

Compute the posterior RB bel state = B^''' {Trf'\sf\l^/\ot) 
Compute the belief chain Cj^j from B^^J 
Compute the new belief state Bf^^^ from Cj^j 
Compute /i(') =c||^(7r*+i) 
Compute the estimator Pr{7r*^_j | dt) ~ Jrbsis = Si^i /i^'^w^') 

End 



Figure 14: RB-SIS for policy recognition 




Figure 15: Sampling the Rao-Blackwellising variables in AHMM 



we also maintain a parametric representation of the Rao-Blackwellised belief state , and 
the value of the h function for that sample h^'''\ The weights of the samples, together with 
the values of the h function can then be combined to yield an approximation for /. 

Some details on how we can obtain the new samples at each time step are worth noting 
here. Since we are using the optimal sampling distribution qt = Bt{st,lt\ ot) to sample the 
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RB variables st and It, we need to perform the evidence reversal stepJ This can be done 
by positioning the root of the belief chain Ct at st and reverse the link from st to Of . This 
gives us the network structure for Bf^ = Bt{st, kjT^f^l ot) which is exactly the same as Bt 
(see Fig. 15), except that the evidence ot has been absorbed into the marginal distribution 
of St- The weight wt = Bt{ot) can also be obtained as a by-product of this evidence reversal 
step. In order to sample st and It from Bf^ without the need to compute the marginal 
distribution for these two variables, we can use forward sampling to sample every variable 
of iSf, starting from the root node st and proceeding upward. Since It by definition is the 
highest level of policy termination, the sampling can stop at the first level k where = F. 
We can then assign li the value k — 1. Any unnecessary samples for the policy nodes along 
the way are discarded. Once we have the new samples for st and the updating of the 
RB belief state from Bt to Bt+i is identical to the belief state updating procedure described 
in 5.1. The h function can then be obtained by computing the corresponding marginal of 
the new belief chain Ct+i- 

At each time step, the complexity of maintaining a sample (sampling the new RB 
variables and updating the RB belief state) is again 0(/i), and thus, on average, bounded 
by a constant. The overall complexity of maintaining every sample is thus 0{N) on average. 
If a prediction is needed, for each sample, we have to compute h by manipulating the chain 
Ct+i with the complexity 0{K). Thus the complexity at the time step when a prediction 
needs to be made is 0{NK). 

In comparison with the use of an SIS method such as LW-ER, the RB-SIS has the 
same order of computational complexity (the SIS also has complexity 0{NK)). However, 
while the SIS method needs to sample every layers of the AHMM, the RB-SIS method only 
needs to sample two sequences of variables st, It, and avoids having to sample the K policy 
sequences {^^^}. After Rao-Blackwellisation, the dimension of the sample space becomes 
much smaller, and more importantly, does not grow with K. As a result, the accuracy of 
the approximation by the RB-SIS method does not depend on the height of the hierarchy 
K. In contrast, due to the problems of sampling in high dimensional space, the accuracy of 
SIS methods tends to degrade, especially when K is large. 

5.2.2 Performing Evidence Reversal with a Factored State Space 

In many cases, the state space S is the Cartesian product of many state variables repre- 
senting relatively independent properties of a state: st = (sj , s^, . . . , s^). Since the overall 
state space is very large, specifying an action by the usual transition probability matrix is 
problematic. It is advantageous in this case to represent the state information in a factored 
form, i.e., representing each state variable s™ in a separate node rather than lumping them 
into a single node st. It has been shown that using factored representations, we can specify 
the transition probability of each action in a compact form since an action is likely to affect 
only a small number of state variables and the specification of the effects of actions has 
many regularities (Boutilier et al., 2000). 



7. The term evidence reversal is used in this paper to refer to a general procedure in which the link to 
the observation node is reversed prior to sampling (Kanazawa et al., 1995), thus allowing us to sample 
according to the optimal sampling distribution qt. 



482 



Policy recognition in the Abstract Hidden Markov Model 



The representation of the behef chain Ct and also the RB belief state Bt can take direct 
advantage of this factored representation of actions. Indeed, the chain parameter Ct(sf |7r°) 
of the link from to St is precisely the transition probability for the action ttj at the 
previous state st-i (note that st-i is known due to Rao-Blackwellisation). This conditional 
distribution can be extracted from the compact factored representation of ttj in the general 
form of a Bayesian network of the variables {sj, si, ... , s^}. For our convenience, let us 
denote this Bayesian network by .7^(.|7r^). This network is usually sparse enough so that ex- 
act inference can operate efficiently. For example, in the special case where {s\, s^,. . . , s^} 
are independent given 7r° and st-i, will be factored completely into the product of M 
marginals of . 

Although factored representations can be used as part of the RB belief state, care must 
be taken when performing evidence reversal, i.e. to reverse the link from the state variable 
to the observation node. In the procedure for evidence reversal discussed previously (see 
Fig. 15), we first position the root of at the node st, thus need to compute and represent 
the distribution Pr(st). In the factored state space case, this becomes a joint distribution 
of all the state variables {sj, s^, . . . , s^}. Without conditioning on the current action tt^ , 
the factored representation of the state variables {s™} cannot be utilised, thus resulting in 
complexity exponential in M. 

The key to get around this difficulty is to always keep the specification of the distribution 
of the current state conditioned on the current action, not vice versa. Thus, when computing 
Sf^ = Bt{.\ot), we first position the root of the chain Ct at ttj, and then reverse the evidence 
from Of to both 7r° and .s^. In algebraic form, we use the following factorisation of the joint 
distribution of the current action and state given the current observation: 

Pr(7r°,Si|ot) = Pr(st|7r°,Oi)Pr(7r°|ot) (13) 

Fig. 16 illustrates this evidence reversal procedure. In the model depicted here, JF can 
be an arbitrary Bayesian network. The observation model can be specified by attaching the 
observation nodes {o^,o^, . . .} to the state variables. The overall network representing the 
distribution Fr{st,ot \ Trf) will be denoted by !F°^^{.\7rf ). 

We first look at the first term in the RHS of (13). Let J^^"^ , ot) represent the 
distribution Pr(.Sf 1 7r°, Of). Note that JT'^'" can be obtained by conditioning J^"^'^ {.\7rf) on the 
observation o^. This can be achieved by applying an exact inference method such as the 
clustering algorithm (Lauritzen Sz Spiegelhalter, 1988) on the network J^"^^{.\'K^ ). 

For the second term in the RHS of (13), we note that: 

Pr(o, I TT?) = Prist, Ot I TT?) = J2 ^"''i'u Ot\4) 

St St 

This integration can be readily obtained as a by-product when performing the above clus- 
tering algorithm on {.\'k'1) . Once Pr(of 1 7r°) is known, we can compute Pr(7r° | oj) by: 

Pr(7rO|oi)cxPr(ot|7rO)Pr(7rO) 

This shows that the belief state after evidence reversal = Bi{. \ oi) still has a simple 
structure that exploits the independence relationships between the state variables {s™} 
given the current action tt^. Sampling the RB variables from this structure can proceed as 
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Belief state before evidence reversal Belief state after evidence reversal 



Figure 16: Evidence reversal with factored state space 



follows: Pr(7r° | Of) is first used to sample tt^; J^^'^ {stln^ , ot) is then used to sample s^. Once 
we have obtained the sample for tt^ and Sf, we can proceed to sample the remaining nodes 
in the network to obtain a sample for It as usual. Finally, we note that the weight 
wt = Pr(ot) can also be computed efficiently by: 

PrK) = ^PrK|7r?)Pr(7rO) 

In this evidence reversal procedure, for each value of n^, we need to perform exact 
inference on the structure of J^"^^ {st, otlir^) . Thus the complexity of this procedure heavily 
depends on the complexity of the network structure of JF. However, as we have noted, 
due to the nature of the factored representation, JF usually has a sparse structure so that 
exact inference can be performed efficiently. For example, in the special case where !F 
is completely factored into the product of M independent state variables which are then 
independently observed, the complexity becomes linear w.r.t. M. 

6. Experimental Results 

In this section, we present our experimental results with the policy recognition algorithm. In 
subsection 6.1, we demonstrate the effectiveness of the Rao-Blackwellised sampling method 
for policy recognition by comparing the performance of our Rao-Blackwellised procedure 
against likelihood weighting sampling in a synthetic tracking task. In subsection 6.2, we 
present an application of the AHMM framework to the problem of tracking human be- 
haviours in a complex spatial environment using distributed video surveillance data. 
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Figure 17: The environment and a sample trajectory 



0.8 



0.6 



0.4 



0.2 ^ 



Destination probabilities 



— I — ~ 

west 
south 
east 
north 



J I I llrb I I I I I I u L 



5 10 15 20 25 30 35 40 45 50 55 60 
Time 



Figure 18: Probabilities of top-level destinations over time 



6.1 Effectiveness of Rao-Blackwellisation 

To demonstrate the effectiveness of the Rao-Blackwellised inference method for AHMM, we 
again consider the synthetic tracking task in which it is required to monitor and predict 
the movement of an agent through the building environment previously discussed in sub- 
section 3.3. The structure of the AHMM used is the same as the one shown in Fig. 5. The 
parameters of the policies are chosen manually, and then used to simulate the movement of 
the agent in the building. To simulate the observation noise, we assume that the observa- 
tion of the agent's true position can be anywhere among its 8 neighbouring cells with the 
probabilities given by a predefined observation model. 
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Figure 19: Performance profiles of SIS vs. RB 
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Figure 20: Efficiency coefficients of SIS and RB-SIS 



We implement the RB-SIS method (with re-sampling) and use the policy hierarchy 
specification and the simulated observation sequence as input to the algorithm. In a typical 
run, the algorithm can return the probability of the main building exit, the next wing exit, 
and the next room-door that the agent is currently heading to. An example track is shown 
in Fig. 17. As the observations about the track arrive over time, the prediction probability 
distribution of which main building exit the track is heading to is shown in Fig. 18. 

To illustrate the advantage of RB-SIS, we also implement an SIS method without Rao- 
Blackwellisation (LW with ER and re-sampling (Kanazawa et al., 1995)) and compare the 
performance of the two algorithms. We run the two algorithms using different sample pop- 
ulation sizes to obtain their performance profiles. For a given sample size N, the standard 
deviation {a{N)) over 50 runs in the estimated probabilities of the top-level policies is used 
as the measure of expected error in the probability estimates. We also record the average 
time taken in each update iteration. 

Fig. 19(a) plots the standard deviation of the two algorithms for different sample sizes. 
The behaviour of the error follows closely the theoretical curve cr(iV) = c/^/N, or (j'^{N) = 
c^/N, with csis ~ 0.26 and crb-sis ~ 0.055. As expected, for the same number of samples, 
the RB-SIS algorithm delivers much better accuracy. 

Fig. 19(b) plots the average CPU time (T) taken in each iteration versus the sample 
size. As expected, T{N) is linear to N, with the RB-SIS taking about twice longer due to 
the overhead in updating the RB belief state while processing each sample. 

Fig. 19(c) plots the actual CPU time taken versus the expected error for the two algo- 
rithms. It shows that for the same CPU time spent, the RB-SIS method still significantly 
reduces the error in the probability estimates. 

Note that for each algorithm, the quantity rj = cr^(7V)T(iV) is approximately constant 
since the dependency on N cancels one another out. Thus, this constant can be used as an 
efficiency coefficient to measure the performance of the sampling algorithm independent of 
the number of samples. For example, if an algorithm has a twice smaller coefficient, it can 
deliver the same accuracy with half CPU time, or half the variance for the same CPU time. 
Fig. 20 plots the efficiency coefficients for both SIS and RB-SIS, with rjsis ~ 0.0018 and 
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rjRB-sis ~ 0.000235. This indicates a performance gain of almost an order of magnitude 
(8 folds) for RB-SIS. 

6.2 Application to Tracking Human Behaviours 

Using the policy recognition algorithm, we have implemented a real-time surveillance system 
that tracks the behaviour of people in a complex indoor environment using surveillance video 
data. The environment consists of a corridor, the Vision lab and two offices (see Fig. 21). 
People enter/exit the scene via the left or the right entrance of the corridor. The system 
has six static cameras with overlapping field of views which cover most of the ground plane 
in the scene. 

The entire environment is divided into a grid of cells, and the current cell position of 
the tracked object acts like the current state in our AHMM. The cameras are calibrated so 
that they can return the current position of the tracked object on the ground, however the 
returned coordinates are unreliable as the cameras have to deal with noisy video frames and 
occlusion of objects in the scene. For more information on how low-level tracking is done 
with multiple cameras, readers are referred to (Nguyen, Venkatesh, West, Sz Bui, 2002). 
We assume that the observation of a state can only be in the area surrounding it, thus the 
observation model is a matrix specifying the observation likelihood for each cell within a 
neighbourhood of the current state. 

The policy hierarchy for behaviours in this environment is constructed as follows. First, 
we construct the region hierarchy with three levels. At the bottom level, we identify 7 
regions of special interest: the corridor, the two offices, the areas surrounding the Linux 
server, NT server, printer, and the remaining free space in the Vision lab (Fig. 21). At the 
higher level, all regions in the Vision lab are grouped together. The top level consists of the 
entire environment. The policy hierarchy representing people's behaviors has three levels 
corresponding to the three levels of the region hierarchy (see Fig. 23). At the bottom level, 
we are interested in the behaviours that take place within each of the 7 regions of interest. 
For example, near the Linux server, the person might be using the Linux machine, or simply 
passing through that region, leading to two different policies. Similar policies are defined 
for the NT server region, the printer region, and the two small offices. In the corridor 
and inside the Vision lab (region 1 and 5), we construct different policies corresponding to 
the different destinations that the person is heading to. Region 5 also has a special policy 
representing the "walk-around" behaviour. At the middle level, three policies are defined 
for the corridor and office space representing a person's plan of exiting this space by the 
left/right entrance or by the door of the Vision lab. We define only one policy for the Vision 
lab to represent the typical behaviour of a lab user (e.g., go to Linux server, followed by 
go to printer).^ Finally, for the top level region (the whole environment), we define two 
policies representing a person's leaving the scene via the left/right entrance. 

Fig. 21 and 22 show two concurrent trajectories of two different people in this environ- 
ment. Some sample video frames captured by the different cameras in the system are shown 
in Fig. 24. 

With the AHMM model defined above, and a sequence of observations returned by 
the cameras, we first determine the performance profiles of RB-SIS and SIS in this real 

8. If we consider different groups of lab users, each group might give rise to a different policy at this level. 
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Figure 21: The environment and the trajectory of person 1 
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Figure 22: The trajectory of person 2 
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Figure 25: Performance of RB-SIS and SIS with real tracking data 
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Figure 26: The probabilities that person 1 is leaving the scene via the entrances (top level 
policies) 



environment. The two algorithms behave in a similar way as in the previous experiment 
with simulated data. Fig. 25 shows the error curve against the CPU time for the two 
algorithms. The efficiency co-efficient for RB-SIS in this case is r]nB-sis ~ 0.011, and for 
SIS is r]sis ~ 0.06. This shows that the RB-SIS still performs about 5 times better than 
SIS in this domain. 

In the surveillance system, the low level tracking module returns the observations at 
the rate of approximately two per second. The observation is then passed to the RB-SIS 
algorithm which produces the probability estimate of the current policy at different levels 
in the hierarchy. At the moment, our surveillance system can run in real time using two 
AMD IG machines. Examples of the output returned by the system for the two trajectories 
in Fig. 21 and Fig. 22 are given below. 

Fig. 26 shows the probabilities that person 1 is exiting the environment by the left 
or right entrance (denoted by Pi^jf g and Pj-jqi^f g respectively). At the beginning, P/gj^ g 
increases when person 1 is heading to the left entrance (see the trajectory in Fig. 21). Then, 
Pleft e approximately constant from time slice 50 when person 1 is inside the Vision lab. 
This is because only one middle level policy is defined for the Vision lab and his movement 
inside the lab is independent of his final exit/entrance. At time slice 310, J^/gj^ g decreases 
when person 1 is leaving the lab, turning right, and entering office 2. Then, it increases and 
approaches 1 when he is leaving office 2, turning left, and going towards the left entrance. 
In contrast, Pyigi^i g falls quickly to zero during this time. 

We now look at the results of querying of the bottom level policies. Fig. 27 shows the 
distribution of the possible destinations of person 2 from time slice 180 to time slice 260, 
when he is in region 5 (see the trajectory in Fig 22). The probabilities obtained show that 
the system is able to correctly detect the "walk-around" behaviour. 

The final result (Fig. 28) shows the inferred behaviours of person 1 when he is at the 
Linux server region. Initially, the probabilities for "using Linux server" and for "passing 
through" are the same. As the person stays in the same position for an extended period of 
time, the system is able to identify the correct behaviour of person 1 as "using the Linux 
server" . 
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Figure 27: Behaviours of person 2 inside the Vision lab 
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Figure 28: Behaviour of person 1 inside the Linux server region 
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7. Related Work in Probabilistic Plan Recognition 

The case for using probabilistic inference for plan recognition has been argued convincingly 
by Charniak and Goldman (1993). However, the plan recognition Bayesian network used 
by Charniak and Goldman is a static network. Thus their approach would run into prob- 
lems when they have to process on-line a stream of evidence about the plan. More recent 
approaches (Pynadath & Wellman, 1995, 2000; Goldman et al., 1999; Ruber et al., 1994; 
Albrecht et al., 1998) have used dynamic stochastic models for plan recognition and thus 
are more suitable for doing on-line plan recognition under uncertainty. 

Among these, the most closely related model to the AHMM is the Probabilistic State- 
Dependent Grammar (PSDG) (Pynadath, 1999; Pynadath &; Wellman, 2000). A compari- 
son of the representational aspect of the two models has been discussed under subsection 3.4. 
In terms of algorithms for plan recognition, Pynadath and Wellman only offer an exact 
method to deal with the case where the states are fully observable. When the states are 
partially observable, a brute-force approach is suggested which amounts to summing over all 
possible states. We note that even for the fully observable case, the belief state that we need 
to deal with can still be large since the policy starting/ending times are unknown.^ Since 
an exact method is used by Pynadath and Wellman, the complexity for maintaining the 
belief state would most likely be exponential to the number of levels in the PSDG expansion 
hierarchy (i.e., the height of our policy hierarchy). On the other hand, our RB-SIS policy 
recognition algorithm can handle partially observable states and the Rao-Blackwellisation 
procedure ensures that the sampling algorithm scales well with the number of levels in the 
policy hierarchy. Furthermore, as we have noted in subsection 3.4, if we consider compound 
policies, the PSDG can be converted to an AHMM. In our framework, a compound policy 
'K^ = TT^]^^, . . . , ^(^m)^ can be represented just as a normal policy, with a slight modification 

to let the variable take on values between 1 and m + 1, where the value m + 1 indicates 
that the compound policy has terminated. The policy recognition algorithm can then be 
modified to also work with this model. 

Similar to our AHMM and the PSDG, the recent work by Goldman et al. (1999) also 
makes use of a detailed model of the plan execution process. Using the rich language of 
probabilistic Horn abduction, they are able to model more sophisticated plan structures 
such as interleaved/concurrent plans, partially-ordered plans. However the work serves 
mainly as a representational framework, and provides no analysis on the complexity of plan 
recognition in this setting. 

Other work in probabilistic plan recognition up to date has employed much coarser 
models for plan execution. Most have ignored the important influence of the state of the 
world to the agent's planning decision (Goldman et al., 1999). To the best of our knowledge, 
none of the work up to date has addressed the problem of partial and noisy observation 
of the state. Most, except the PSDG, do not look at the observation of the outcomes of 
actions, and assume that the action can be observed directly and accurately. We note that 
this kind of simplifying assumptions is needed in previous work so that the computational 
complexity of performing probabilistic plan recognition remains manageable. In contrary, 
our work here illustrates that although the plan recognition dynamic stochastic model can 



9. Of course, if an SRD policy hierarchy is considered then full observability alone is enough. 
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be complex, they exhibit special types of conditional independence which, if exploited, can 
lead to efficient plan recognition algorithms. 

8. Conclusion and Future Work 

In summary, we have presented an approach for on-line plan recognition under uncertainty 
using the AHMM as the model for the execution of a stochastic plan hierarchy and its noisy 
observation. The AHMM is a novel type of stochastic processes, capable of representing 
a rich class of plans and the associating uncertainty in the planning and plan observation 
process. We first analyse the AHMM structure and its conditional independence proper- 
ties. This leads to the proposed hybrid Rao-Blackwellised Sequential Importance Sampling 
(RB-SIS) algorithm for performing belief state updating (filtering) for the AHMM which 
exploits the structure of the AHMM for greater efficiency and scalability. We show that the 
complexity of RB-SIS when applied to the AHMM only depends linearly on the number of 
levels K in the policy hierarchy, while the sampling error does not depend on K. 

In terms of plan recognition, these results show that while the stochastic process for 
representing the execution of a plan hierarchy can be complex, they exhibit certain condi- 
tional independence properties that are inherent in the dynamics of the planning and acting 
process. These independence properties, if exploited, can help to reduce the complexity of 
performing inference on the plan execution stochastic model, leading to feasible and scalable 
algorithms for on-line plan recognition in noisy and uncertain domains. The scalability of 
the algorithm for policy recognition provides the possibility to consider more complex plan 
hierarchies and more detailed models of the plan execution process. The key to achieve this 
efficiency, as we have shown in the paper, is a combination of recently developed techniques 
in probabilistic inference: compact representations for Bayesian networks (context-sensitive 
independence, factored representations), and hybrid DBN inference which can take advan- 
tage of these compact representations (Rao-Blackwellisation). 

Several future research directions are possible. To further investigate the AHMM, we 
would like to consider the problem of learning the parameters of an AHMM from a database 
of observation sequences, e.g., to learn the plan execution model by observing multiple 
episodes of an agent executing the same plan. The structure of the AHMM suggests that 
we can try to learn the model of each abstract policy separately. Indeed, if we can observe 
the execution of each abstract policy separately, the learning problem is reduced to HMM 
parameter re-estimation for level-1 policies, and simple frequency counting for higher-level 
policies. If the observation sequence is a long episode with no clear cut temporal boundary 
between the policies, the problem becomes a type of parameter estimation for DBN with 
hidden variables, and techniques for dealing with hidden variables such as EM (Dempster, 
Laird, h Rubin, 1977) can be applied. 

Extensions can be made to the AHMM to make the model more expressive and suitable 
for representing more complex agents' plans. For example, a more expressive plan execution 
model such as the HAM model (Parr, 1998) can be considered so that state-independent 
sequences of policies can be represented. The current model can also be enriched to consider 
a set of top-level policies which can be interleaved during their execution. We expect that 
these new models would exhibit context-specific independence properties similar to the 
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AHMM, and Rao-Blackwellised sampling methods for policy recognition in these models 
can be derived. 
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